NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests;

Draesslerova, Dominika; Ahmed, Omar; Gagie, Travis; Holub, Jan; Langmead, Benjamin; Manzini, Giovanni; Navarro, Gonzalo (July 2024, SEA 2024)

Full Text Available
Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests

https://doi.org/10.4230/LIPIcs.SEA.2024.10

Draesslerová, Dominika; Ahmed, Omar; Gagie, Travis; Holub, Jan; Langmead, Ben; Manzini, Giovanni; Navarro, Gonzalo (January 2024, Schloss Dagstuhl – Leibniz-Zentrum für Informatik)
Liberti, Leo (Ed.)
For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use k-mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can - build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order; - for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM’s occurrences in those genomes; - find the minimum and maximum values stored in that interval; - take the lowest common ancestor (LCA) of the genomes containing the characters at those positions. This solution is practical, however, only when the total size of the genomes in the tree is fairly small. In this paper we consider applying the same solution to three lossily compressed representations of the genomes' concatenation: - a KATKA kernel, which discards characters that are not in the first or last occurrence of any k_max-tuple, for a parameter k_max; - a minimizer digest; - a KATKA kernel of a minimizer digest. With a test dataset and these three representations of it, simulated reads and various parameter settings, we checked how many reads' longest MEMs occurred only in the sequences from which those reads were generated ("true positive" reads). For some parameter settings we achieved significant compression while only slightly decreasing the true-positive rate.
more » « less
Full Text Available
Compressing and Indexing Aligned Readsets

https://doi.org/10.4230/LIPIcs.WABI.2021.13

Gagie, Travis; Gourdel, Garance; Manzini, Giovanni (January 2021, Workshop on Algorithms in Bioinformatics (WABI))
Carbone, Alessandra; El-Kebir, Mohammed (Ed.)
Compressed full-text indexes are one of the main success stories of bioinformatics data structures but even they struggle to handle some DNA readsets. This may seem surprising since, at least when dealing with short reads from the same individual, the readset will be highly repetitive and, thus, highly compressible. If we are not careful, however, this advantage can be more than offset by two disadvantages: first, since most base pairs are included in at least tens reads each, the uncompressed readset is likely to be at least an order of magnitude larger than the individual’s uncompressed genome; second, these indexes usually pay some space overhead for each string they store, and the total overhead can be substantial when dealing with millions of reads. The most successful compressed full-text indexes for readsets so far are based on the Extended Burrows-Wheeler Transform (EBWT) and use a sorting heuristic to try to reduce the space overhead per read, but they still treat the reads as separate strings and thus may not take full advantage of the readset’s structure. For example, if we have already assembled an individual’s genome from the readset, then we can usually use it to compress the readset well: e.g., we store the gap-coded list of reads’ starting positions; we store the list of their lengths, which is often highly compressible; and we store information about the sequencing errors, which are rare with short reads. There is nowhere, however, where we can plug an assembled genome into the EBWT. In this paper we show how to use one or more assembled or partially assembled genome as the basis for a compressed full-text index of its readset. Specifically, we build a labelled tree by taking the assembled genome as a trunk and grafting onto it the reads that align to it, at the starting positions of their alignments. Next, we compute the eXtended Burrows-Wheeler Transform (XBWT) of the resulting labelled tree and build a compressed full-text index on that. Although this index can occasionally return false positives, it is usually much more compact than the alternatives. Following the established practice for datasets with many repetitions, we compare different full-text indices by looking at the number of runs in the transformed strings. For a human Chr19 readset our preliminary experiments show that eliminating separators characters from the EBWT reduces the number of runs by 19%, from 220 million to 178 million, and using the XBWT reduces it by a further 15%, to 150 million.
more » « less
Full Text Available
Efficiently Merging r-indexes

https://doi.org/10.1109/DCC50243.2021.00028

Oliva, Marco; Rossi, Massimiliano; Siren, Jouni; Manzini, Giovanni; Kahveci, Tamer; Gagie, Travis; Boucher, Christina (March 2021, 2021 Data Compression Conference (DCC))
null (Ed.)
Large sequencing projects, such as GenomeTrakr and MetaSub, are updated frequently (sometimes daily, in the case of GenomeTrakr) with new data. Therefore, it is imperative that any data structure indexing such data supports efficient updates. Toward this goal, Bannai et al. (TCS, 2020) proposed a data structure named dynamic r-index which is suitable for large genome collections and supports incremental construction; however, it is still not powerful enough to support substantial updates. Here, we develop a novel algorithm for updating the r-index, which we refer to as RIMERGE. Fundamental to our algorithm is the combination of the basics of the dynamic r-index with a known algorithm for merging Burrows-Wheeler Transforms (BWTs). As a result, RIMERGE is capable of performing batch updates in a manner that exploits parallelism while keeping the memory overhead small. We compare our method to the dynamic r-index of Bannai et al. using two different datasets, and show that RIMERGE is between 1.88 to 5.34 times faster on reasonably large inputs.
more » « less
Full Text Available
PHONI: Streamed Matching Statistics with Multi-Genome References

https://doi.org/10.1109/DCC50243.2021.00027

Boucher, Christina; Gagie, Travis; Tomohiro, I; Koppl, Dominik; Langmead, Ben; Manzini, Giovanni; Navarro, Gonzalo; Pacheco, Alejandro; Rossi, Massimiliano (March 2021, Data Compression Conference)

Full Text Available
Practical Random Access to SLP-Compressed Texts

https://doi.org/10.1007/978-3-030-59212-7_16

Gagie, Travis; I, Tomohiro; Manzini, Giovanni; Navarro, Gonzalo; Sakamoto, Hiroshi; Seelbach Benkner, Louisa; Takabatake, Yoshimasa (January 2020, SPIRE 2020)
null (Ed.)
Grammar-based compression is a popular and powerful approach to compressing repetitive texts but until recently its relatively poor time-space trade-offs during real-life construction made it impractical for truly massive datasets such as genomic databases. In a recent paper (SPIRE 2019) we showed how simple pre-processing can dramatically improve those trade-offs, and in this paper we turn our attention to one of the features that make grammar-based compression so attractive: the possibility of supporting fast random access. This is an essential primitive in many algorithms that process grammar-compressed texts without decompressing them and so many theoretical bounds have been published about it, but experimentation has lagged behind. We give a new encoding of grammars that is about as small as the practical state of the art (Maruyama et al., SPIRE 2013) but with significantly faster queries.
more » « less
Full Text Available

Search for: All records